item parameter
- North America > United States > Pennsylvania (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory
Cardoso, Lucas, Santos, Vitor, Filho, José Ribeiro, Prudêncio, Ricardo, Kawasaki, Regiane, Alves, Ronnie
Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to characterize and guide the partitioning of datasets in the model validation stage. The impact of IRT-informed partitioning strategies on the performance of several ML models in four tabular datasets was evaluated. The results obtained demonstrate that IRT reveals an inherent heterogeneity of the instances and highlights the existence of informative subgroups of instances within the same dataset. Based on IRT, balanced partitions were created that consistently help to better understand the tradeoff between bias and variance of the models. In addition, the guessing parameter proved to be a determining factor: training with high-guessing instances can significantly impair model performance and resulted in cases with accuracy below 50%, while other partitions reached more than 70% in the same dataset.
- North America > United States (0.05)
- Asia > Middle East > Jordan (0.05)
- South America > Brazil > São Paulo (0.04)
- (2 more...)
Federated Item Response Theory Models
Zhou, Biying, Luo, Nanyu, Ji, Feng
Item Response Theory (IRT) models have been widely used to estimate respondents' latent abilities and calibrate items' difficulty. Traditional IRT estimation requires all individual raw response data to be centralized in one place, thus potentially causing privacy issues. Federated learning is an emerging field in computer science and machine learning with added features of privacy protection and distributed computing. To integrate the advances from federated learning with modern psychometrics, we propose a novel framework, Federated Item Response Theory (IRT), to enable estimating traditional IRT models with additional privacy, allowing estimation in a distributed manner without losing estimation accuracy. Our numerical experiments confirm that FedIRT achieves statistical accuracy similar to standard IRT estimation using popular R packages, while offering critical advantages: privacy protection and reduced communication costs. We also validate FedIRT's utility through a real-world exam dataset, demonstrating its effectiveness in realistic educational contexts. This new framework extends IRT's applicability to distributed settings, such as multi-school assessments, without sacrificing accuracy or security. To support practical adoption, we provide an open-ource R package, FedIRT, implementing the framework for the two-parameter logistic (2PL) and partial credit models (PCM).
- North America > Canada > Ontario > Toronto (0.15)
- Oceania > Australia (0.04)
- North America > United States > Virginia (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Education > Curriculum > Subject-Specific Education (0.46)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness
Cardoso, Lucas, Santos, Vitor, Ribeiro, José, Kawasaki, Regiane, Prudêncio, Ricardo, Alves, Ronnie
Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm's ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.
- South America > Brazil > Pernambuco > Recife (0.04)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Research Report (1.00)
- Workflow (0.68)
- Education (0.67)
- Leisure & Entertainment > Games (0.46)
BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration
Sharpnack, James, Hao, Kevin, Mulcaire, Phoebe, Bicknell, Klinton, LaFlair, Geoff, Yancey, Kevin, von Davier, Alina A.
In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning
Sharpnack, James, Mulcaire, Phoebe, Bicknell, Klinton, LaFlair, Geoff, Yancey, Kevin
Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (5 more...)
- Research Report (0.82)
- Overview (0.68)
Standing on the shoulders of giants
Cardoso, Lucas Felipe Ferraro, Filho, José de Sousa Ribeiro, Santos, Vitor Cirilo Araujo, Frances, Regiane Silva Kawasaki, Alves, Ronnie Cley de Oliveira
Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.
- South America > Brazil > Pará > Belém (0.04)
- Europe > France (0.04)
- North America > United States > District of Columbia > Washington (0.04)
Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis
Liu, Yunting, Bhandari, Shreya, Pardos, Zachary A.
Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (7 more...)
$\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models
Kipnis, Alex, Voudouris, Konstantinos, Buschoff, Luca M. Schulze, Schulz, Eric
Large Language Models (LLMs) vary in their abilities on a range of tasks. Initiatives such as the $\texttt{Open LLM Leaderboard}$ aim to quantify these differences with several large benchmarks (sets of test items to which an LLM can respond either correctly or incorrectly). However, high correlations within and between benchmark scores suggest that (1) there exists a small set of common underlying abilities that these benchmarks measure, and (2) items tap into redundant information and the benchmarks may thus be considerably compressed. We use data from $n > 5000$ LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande (with $d=28,632$ items in total). From them we distill a sparse benchmark, $\texttt{metabench}$, that has less than $3\%$ of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original $\textit{individual}$ benchmark score with, on average, $1.5\%$ root mean square error (RMSE), (2) reconstruct the original $\textit{total}$ score with $0.8\%$ RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is $r = 0.93$.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Illinois > Cook County > Evanston (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Health & Medicine (0.68)
- Education (0.67)